mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)#19441
mtmd: qwen3 audio support (qwen3-omni and qwen3-asr)#19441ngxson wants to merge 9 commits intoggml-org:masterfrom
Conversation
|
Any updates on this? |
|
any updates? |
Add support for Qwen3-ASR-1.7B model (Qwen3ASRForConditionalGeneration): - New QWEN3A projector type for audio-only ASR models - Conv2d encoder (3 layers, stride=2 each, 8x time downsampling) - Whisper-like transformer encoder (24 layers) - MLP projector: Linear(1024,1024) -> GELU -> Linear(1024,2048) - Conversion tested: both mmproj and decoder GGUF files work - Basic inference tested: model loads, encodes audio, generates output Based on PR ggml-org#19441 by ngxson (WIP qwen3 audio), adapted for Qwen3-ASR-only architecture (no vision, no deepstack). Our attention extraction API (llama_set_attn_heads/llama_get_attn_ith) is untouched.
|
I wrote a working Qwen3-ASR support for my own use at https://github.com/michoecho/llama.cpp/commits/qwen3_asr_support. (I successfully used it to transcribe some lectures in Chinese). I don't know if it's good enough for upstreaming, because I wasn't thinking about qwen3-omni at all. (I have no idea what "deepstack" is). But you could use it as a working base if you are getting wrong results. At a glance, what mainly seems to be missing from this PR is:
By the way, note that Qwen3-ForcedAligner (the timestamp predictor model) has the same architecture as Qwen3-ASR, so if you implement support for the latter, you almost get support for the former too. "Almost" because the ForcedAligner is a non-autoregressive classification model. (You put in the encoded audio and the transcribed text with some |
|
both qwen3-omni and qwen3-asr are working with this PR, GGUF will be uploaded shortly |
Chunking can be implemented via a follow-up PR, this PR processes the input as 30s chunk for simplicity
Thanks for pointing out, that need to be fixed in this PR
That was fixed by simply push a chatml jinja template to GGUF upon conversion
Hmm yeah that sounds complicated, will see if it worth implementing knowing that another model (voxtral from mistral) having somewhat same logic |
| return [] | ||
| return [(self.map_tensor_name(name), data_torch)] | ||
|
|
||
| return [] # skip other tensors |
There was a problem hiding this comment.
| return [] | |
| return [(self.map_tensor_name(name), data_torch)] | |
| return [] # skip other tensors | |
| return | |
| yield from super().modify_tensors(data_torch, name, bid) | |
| return # skip other tensors |
There was a problem hiding this comment.
nice catch, yeah this code was written before the yield from refactor 😅
| yield from Qwen2VLVisionModel.modify_tensors(self, data_torch, name, bid) | ||
| elif "audio_tower." in name: | ||
| yield from Qwen25AudioModel.modify_tensors(self, data_torch, name, bid) | ||
| return [] # skip other tensors |
There was a problem hiding this comment.
| return [] # skip other tensors | |
| return # skip other tensors |
| yield (self.map_tensor_name(name), data_torch) | ||
| return [] # skip other tensors |
There was a problem hiding this comment.
| yield (self.map_tensor_name(name), data_torch) | |
| return [] # skip other tensors | |
| yield from super().modify_tensors(data_torch, name, bid) | |
| return # skip other tensors |
| if "thinker_config" in self.hparams: | ||
| vision_config = self.hparams["thinker_config"].get("vision_config", {}) | ||
| else: | ||
| vision_config = self.hparams.get("vision_config", {}) |
There was a problem hiding this comment.
Instead of handling this everywhere, can't we just merge in all sub-configs in thinker_config here:
llama.cpp/convert_hf_to_gguf.py
Lines 974 to 976 in eefcfee
| return | ||
| if "visual." in name or "audio_tower." in name \ | ||
| or "talker." in name or "code2wav." in name: | ||
| return [] |
Status: